A lasso for hierarchical testing of interactions 



Jacob Bien* Noah Simon J and Robert Tibshirani} 

November 7, 2012 



Abstract 

We consider the testing of all pairwise interactions in a two-class problem with many 
features. We devise a hierarchical testing framework that only considers an interaction 
when one or more of its constituent features has a nonzero main effect. It is based on a 
convex optimization framework that seamlessly considers main effects and interactions 
together. We provide examples — both real and simulated- that show a potential gain 
in power and interpretability over a standard (non-hierarchical) interaction test. 

1 Introduction 

We consider the standard two-class problem with yj = 1 or 2 and p features {xn, x i2 , ■ ■ ■ x ip } 
measured on each of i — 1, 2, . . . N observations. Large-scale hypothesis testing for the ef- 
fects of individual features is a challenging problem, and has received much attention in 
recent years (e.g Efron (2010), Dudoit & van der Laan (2008)). The problem of testing 
for interactions between features is even more difficult, as there are (^j interactions, just 
restricting attention to pairs of features. Buzkov, Lumley & Rice (2011) show that standard 
permutation tests cannot be used for interaction testing, and propose instead a parametric 
bootstrap-based approach. Simon & Tibshirani (2012) devise a permutation approach that 
exploits the close relationship between the "forward" logistic model (based on Y\X) and a 
"backward" discriminant analysis (Gaussian) model (based on AT|Y). 

When p is large, the large number of potential interactions can result in low power 
for detecting the true effects. One strategy used by data analysts is to first screen the 
data for significant main effects, and then only test for interactions among those features 
that are themselves significant. This can be an effective approach, but has some drawbacks. 
Specifically, at what threshold does one stop entering main effects? And should this threshold 
depend on the strength of the interactions? 

The above two-stage strategy can be thought of as "hierarchical": interactions are only 
considered if both constituent main effects are significant. In this paper we propose a convex 
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formulation that models main effects and interactions together, in a hierarchical fashion. It 
provides a testing framework that seamlessly combines main effects and interactions. The 
method is closely related to the recently proposed hierarchical lasso regression method ( "hi- 
erNet") of Bien, Taylor & Tibshirani (2012). We focus exclusively on pairwise interactions 
in the paper, but discuss possible extensions to higher order interactions in Section [7J 

2 Testing interactions using a convex formulation 

Before we begin, we will define some notation. Let jji G {1,2} be the class label for 
observation i, and denote the set of indices of the classes 1 and 2 by Cg, £ = 1,2. Let x^- be 
the jth feature for the ith observation. Furthermore let 



E[xij] = (j,j ti for i e C, 



I: 



Cor (xij, x ik ) = — J -^— = p jk)t for i e C t 

where pj^, a^i and &jk,i are f ne class-specific feature means, variance and covariances and 
Pjk t e is the class specific correlation. 

It is reasonable to base a test of the interaction between predictors j and k on a test 
of equality of the correlations. In particular Simon & Tibshirani (2012) show that, in a 
fairly general framework, a test for equality of correlations is also implicitly a test for the 
interaction term in a forward logistic regression model for Y as a function of Xj and X k . 

Hence we are interested in testing the hypotheses: 

H o,j '■ = A*i,2 for 1 < j < p 
H ,jk ■ Pjk,i = Pjk,2 for 1 < j < k < p. 

We test the main effects using the standard t-statistic 



Sjyf\fn\ + l/n 2 

For the interactions we define a new variable: 

z iJk = ^ = ^ {Xlk - ^ for i E Ci, £=1,2 

Sj,eSk,i 

where Sj t g is the usual estimate of the standard deviation for covariate j in class I. This new 
transformed variable is really just a single observation estimate of Pjk,e- As a test statistic 
we use the usual t statistic with this new definition of Zi j^: 

_ Zjk,l — Zjk,2 

3 Sjkyjl/nx + l/n 2 ' 

where Zj k ,t is the sample mean of z.j^ in class I and Sjk is a pooled estimate of the standard 
deviation of z.j k 
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We would like to select interactions based on the size of Zjk, but also somehow give a 
"boost" to interactions whose main effects are large. One could try to achieve this through 
a two-stage procedure, where we first screen the individual features, and then test for in- 
teractions only among those features selected at the first phase. This kind of method is 
explored for example in Kooperberg & LeBlanc (2008) and Hsu, Jiao, Dai, Hutter, Peters & 
Kooperberg (2012). 

Instead, we approach this problem through a joint optimization involving both Wj and 
Zjk. Consider minimizing the function 

Lx lM ({#}, {/?"}, {0 jh }) = ^(» r (?-^)f + ^Efe-^+ 

3=1 j=l j^k 

*i£[# + #] + *a£Z>*l (!) 

j=l j=l kjtj 

subject to the constraint that (3f > (here we think of the main effect as (3j = (3+ — fij). 

For a fixed Ai and A2, the solutions /3j and 9j k are soft-thresholded versions of Wj and Zj k - In 
a sense, this criterion "selects" main effects for which \wj\ > X\ and interactions for which 
\ z jk\ > A2. Such a procedure, however, does not share information between main effects 
and interactions. Our proposal in this paper is to add a convex hierarchy constraint to the 
problem, which will lead to main- effect "informed" thresholds for testing the interactions 
(and likewise interaction "informed" thresholds for testing main effects). Define 



(/3+,/T, 0) = argminL Al , A2 ({/?+}, {(3f}, {9 jk }) subject to (3f > 0, £ |0 ifc | < 0? + (3~. 

k 

(2) 

While we could use two separate parameters, Ai and A2, we have found that Ai = A2 = A 
is a good choice, and use it throughout. Our idea is to fit a path of models (parameterized 
by A) and then define the test statistic for the jkth interaction to be A' fc , the largest A for 

which either 6jk or 9kj is nonzero. In the same way, for each main effect j we compute Xj, 
the largest A for which either /3j~ or (3~ is non-zero. That is, letting /3(A) = /3 + — $~ and 
9(\) denote the solution for Ai = A2 = A, our proposed test statistics are 



Xj = sup{A > : 0j{X) y£ 0} 

X' jk = max{A jfc , X kj } 

where 

X jk = sup{A > : 9 jk (X) 0} 

X kj = sup{A>0: 9 kj (X)^0}. (3) 
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In Lemma [T] of the Appendix, we prove that ([2]) has a unique solution for each A > 0, 
so Xj and Xj k are well-defined. Without the hierarchy constraints in (j2J), we would have 
Xjk = \Zjk\ and Xj = \wj\. The addition of the constraint imposes a "budget" Bj~ + 8~ on the 
total interactions that involve feature j. In particular, the constraint \9j k \ < + (3~ 
implies that at least one of and Q~ must be non-zero, in order for 9j k to be non-zero. 

Although in theory we could have = 3~ with both values positive, this happens with 
probability zero under reasonable assumptions. 

As a result, 9j k 7^ implies Bj 7^ 0, and similarly for 9^. Thus the jkth interaction 
is nonzero if at least one of Bj or Bk is nonzero. Bien et al. (2012) (and others) call this 
property weak hierarchy, in contrast to strong hierarchy, which requires both Bj and (3 k to 
be nonzero, in order for 9j k to be nonzero. In terms of our test statistics, weak hierarchy 
implies that 

X' jk < max{A i; X k }. 

We note also that problem (j2J) is convex, due to the fact that we have represented each 
main effect B~ as the difference of two non-negative quantities Bj, Bj. It would not be convex 

J J J 

if we had used \Bj\ in place of Bj + B~ in the constraint above. 

We call our method convex hierarchical testing, or sometimes "CHT" for short. The 
corresponding version of this proposal without the hierarchy constants, we call the all pairs 
method. This approach simply orders the Zj k by their absolute values. 

Note also that Simon & Tibshirani (2012) find that the Fisher transform of the correlation 
can improve the performance of the all-pairs test. We don't use that here for CHT, since it 
affects the relative scaling of main effect and interaction test statistics. 

Example. We simulated Gaussian data with N = 200, p = 5 and strong main effects 
for the first two variables and strong interactions for variables (1,3) and (2,5). Thus the 
true model obeys (weak) hierarchy. Figure [2] shows the test statistic A for each main effect 
and interaction (horizontal axis) and the corresponding z score (wj or Zjk) on the vertical 
axis. We see that the procedure enters variables 2, 1, interactions (1,3), variable 3 and 
then interaction (2,5). The latter interaction is entered before (4,5), which has a (slightly) 
larger z value and so would be entered first by the all-pairs procedure. This example is 
only for illustration, but we demonstrate later through simulations that convex hierarchical 
testing exhibits higher power than the all pairs approach, when the true underlying scenario 
is hierarchical. 

3 Details of the procedure 

Although the ranking of interactions from the above procedure comes from a seemingly 
complicated optimization problem, the solutions actually have a simple form. In particular, 
we show in the Appendix that 

9 jk =S(z jk , X + aj) (4) 

Here S(x,t) = sign(x) • (|x| — 1) + (the soft-thresholding function), and the value <x,- G [0, A] 
emerges from the solution to problem (j2J), with atj = if the hierarchy constraint Ylk \@jk\ ^ 



4 




Figure 1: Convex hierarchical testing applied to simulated example. Plot shows the test statistic A 
for each main effect and interaction (horizontal axis) and the corresponding z score (wj or Zjk) on 
the vertical axis. 



@t @j * s l° ose (i- e -> a strict inequality) and greater than zero otherwise. 
We interpret these conditions as follows: 

• djk can only be non-zero if at least one of and (3~ is non-zero. 

• if there are large main effects and/or then the hierarchy constraint will tend 
to be loose and hence &j — 0. Then Zjk is simply soft-thresholded by A. If this is true 
for all j, then CHT simply chooses all interactions Zjk larger than A in absolute value. 
This is the same as in the all pairs approach. 

• If the main effects are not very large, then the hierarchy constraint will be tight and 
hence 6tj may be larger than zero. Then from (j3J) we see that a higher threshold is 
applied to Zjk,&nd it may not be selected. In this sense, a more stringent screening rule 
is applied to Zjk if its associated main effects j and k are not large. 

Recall that our test statistic for interaction jk is computed as the largest value of A such 
that 9jk or 9^ is nonzero. The following lemma adds justification for this test: 

Property. Let 9jk(\) solve (j2J) as a function of A, with Ai = A2 = A. Then |#jfc(A)| is a 
non- increasing function of A. 

This property is proved in the Appendix (Proposition [2]), and implies that once an inter- 
action is selected, it remains selected for all smaller values of A. 
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In fact, in the Appendix we are able to derive (after much work!) an explicit formula for 
the test statistics Xj and Xjk- 



Proposition. The main effect and interaction test statistics have the following closed- 
form expressions: 



Xj = max < \vjj\, 



\ ~t~ 11^7- 1| oo 



l \ z jk\ 

X jk = mm ^ \z jk \, + 



J + 



(5) 

where Zj. = {zj k : k ^ j} G R p_1 is the vector of interaction contrasts involving the jth 
variable. 



Proof. See Proposition H] in the Appendix. □ 

These formulae are somewhat complex, but we can interpret them loosely as follows. 
Each main effect is "boosted" by the size of the largest interaction in its row, due to the 
hierarchy constraint. In contrast, each interaction is shrunken by as much as half of its 
size, with the shrinkage amount less when the main effect is large or the interaction is large 
relative to the other interactions in that row. Interestingly, Xjk depends only on Wj and 
those interactions in the jth row that are at least as large (in absolute value) as Zjk- Figure 
|2] gives a graphical illustration of these formula. We set the number of interactions to 50. 
In the left panel the interaction contrasts Zj are generated as iV(0, 1). The plot shows the 
test statistic Xj as a function of Zj and the main effect w (different colored curves with main 
effect indicated), along with the 45° line. We see that the interaction effect is shrunken 
substantially until it reaches about 2.75, and the amount of shrinkage is less when the main 
effect is larger. In the right panel there are p — 1 = 49 small interactions distributed as 
iV(0, .5 2 ) and one large interaction whose value varies along the horizontal axis. Now we see 
that there is shrinkage only about till a value of 1.5, and a main effect of 1.5 is sufficient to 
ensure no shrinkage at all. 

4 A simulation study 

We simulated some Gaussian data with N = 200 observations and p = 50 features in 
two classes y — 1,2. There are three different scenarios. In the "Hierarchical truth" setup, 
we randomly chose five features to have strong nonzero main effects, and then 9 of their 
associated interactions to be nonzero, with moderate-sized effects. In the "Non-hierarchical" 
setup (also referred to as "No main effects" ) there are no main effects, and simply 45 non-zero 
interactions chosen at random. Finally, in the "Anti-hierarchical truth" setting, the effects 
are the same as in the hierarchical setup, except that the interactions are concentrated 
exclusively on variables having no main effects. We compared the convex hierarchical and 
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Gaussian 



Mixture 




Figure 2: Graphical illustration of formula |5l) for two different distributions of interactions (two 
panels) and different size of main effects w (colored lines). Broken line is the 45° line. Full details 
in text. 

all-pairs testing procedures, along with two different two-stage screening methods: in the 
"strong" version we retained all main effects with z scores above the 75th percentile, and 
then in the second stage tested for interactions only among the retained variables. In the 
weak version, we considered all interactions among pairs of variables where ar least one 
variable had a z score above the 75th percentile. 

Figure 3 shows the true false discovery rate for testing interactions, averaged over 20 
simulations. In the hierarchical scenario, we see that convex hierarchical testing shows a 
substantial improvement in FDR over all-pairs, with the weak screen method performing 
a little worse. In the non-hierarchical scenario, CHT performs about the same as the all 
pairs method, while the screening methods do poorly. Not surprisingly, the false discovery 
rates are higher overall in the non-hierarchical scenario. In the anti-hierarchical setting, not 
surprisingly, convex hierarchical testing does poorly, with the weak screening perhaps a little 
better. 

In Figure HJ we varied the strength of the main effects in the hierarchical setting and 
compared all-pairs and CHT in their ability to correctly detect interactions. We estimated 
the average number of non-null interactions called significant (over 20 replications) with 
FDR< .2, for varying sample sizes (horizontal axis) and size of the main effect. There are 
45 non-null interactions in the underlying model. In the top panel, the all-pairs method 
does the best, but requires close to 1000 samples to detect all 45 interactions. In the middle 
panel, with moderate main effects, CHT has captured all 45 effects with just 500 samples. 
In the hierarchical scenario, the problem is more tractable overall: with say 200 samples, 
CHT captures about 25 effects in the middle scenario, while the all-pairs method captures 
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Hierarchical truth 



Q 




All pairs 
CHT 

Screen: strong 
Screen: weak 



Number called significant 



No main effects 



rr 




Number called significant 



Anti-hierarchical truth 




20 40 60 80 100 

Number called significant 



Figure 3: True false discovery rates of four different procedures over three different settings. The 
convex hierarchical test has lowest FDR when the true effects are hierarchical (top panel), with 
the weak screen method performing a little worse. CHT performs about the same as the all pairs 
methods when there are no main effects (middle panel), while the screening methods do worst. In 
the bottom panel, there are strong main effects and interactions, but the interactions occur for 
predictors without main effects, and CHT performs badly. The screening methods perform poorly 
as well, except for weak screening for a small number of features. 
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Predictor number 



Name 



Predictor number 



Name 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 



Reached menopause? 



14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 



PTPNli4INV 
CypllB2xlINV 
PTPNlx9INV 
ADRB3W1R 
KLKQ3E 
AGT2R1A1166C 
AVPR2G12E 



insulin t=-10 
insulin t=60 

insulin t=120 
HUT2SNP5 
HUT2SNP7 
BADG16R 



AVPR2A1629G 
AGT2R2C1333T 



MLRI2V 
AGTG6A 



PPARG12 
CD36x2aINV 
MLRi6INV 



CypllB2-5paINV 



PTPNlil 
PTPNH4 



CypllB2i4INV 



Table 1: List of predictors in the Sapphire dataset. 



only about 15 in the top panel. Increasing the size of the main effect in the bottom panel 
seems to make little difference. 

5 Real data example- SAPPHIRe study data 

This data was analyzed in Park & Hastie (2008), following the study of Huang, Lin, 
Narasimhan, Quertermous, Hsiung, Ho, Grove, Olivier, Ranade, Risch & Olshen (2004). The 
study was aimed at finding genes associated with hypertension. The dataset consists of the 
genotypes on 21 distinct loci and the menopausal and insulin resistant status of hypotensive 
and hypertensive Chinese women (with 216 and 364 subjects, respectively). The predictors 
are listed in Table [TJ 

The first four (non-genetic) predictors have the strongest effects on their own, although 
none were significantly different across the two groups (details not shown). Table |2] shows 
the first ten interactions found by the all-pairs and convex hierarchical test methods. The 
selected interactions are very different, with CHT focussing on clinical by genetic interaction, 
due to the strength of the clinical factors on their own. Figure depicts the main effects 
and interactions found by CHT for different values of the regularization parameter A. 

To examine the stability of the two analyses, we carried out a bootstrap analysis, record- 
ing the top ten pairs appearing in the analysis from each of 500 bootstrap samples. The 
ten most frequently occurring interactions are shown in Table [3] We see that the CHT test 
shows a moderately higher reproducibility among the top ten selected interactions. 

Figure 6 contains another investigation into the reproducibility of the two different meth- 
ods. We divided the data many times at random into two approximately equal sized parts, 
and recorded how many interactions appeared in both top k lists, for each method. In the 
figure, k varies along the horizontal axis. Again we see that the overlap is larger for the CHT 
method, although this overlap is fairly small overall. 
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Main effect= 






Figure 4: Average number of True Positives (non-null interactions) called significant (over 20 
replications) with FDR< .2, for varying sample sizes (horizontal axis) and size of the main effect. 
True number of non-null interactions is 45- 
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All pairs 
Predictor 1 



CypllB2xlINV 
PPARG12 
PPARG12 
MLRi6INV 
CypllB2i4INV 
CypllB2xlINV 
AGTG6A 
Affected status 
Reached menopause? 
AVPR2A1629G 
Convex hierarchical test 



Predictor 2 
~AGTG6A 
ADRB3W1R 
CD36x2aINV 
CypllB2xlINV 
AGTG6A 
AVPR2G12E 
PTPNlil 
KLKQ3E 
HUT2SNP5 
PPARG12 



Reached menopause? 
insulin t=-10 
Affected status 
Affected status 
Reached menopause? 
Reached menopause? 
insulin t=-10 
Reached menopause? 
insulin t=60 
insulin t=60 



HUT2SNP5 

MLRi6INV 

insulin t=60 

KLKQ3E 

MLRi6INV 

PTPNlx9INV 

ADRB3W1R 

AGTG6A 

CD36x2aINV 

MLRi6INV 



Table 2: Top ten interactions found by all pairs and convex hierarchical test methods. 
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All pairs 




Predictor 1 


Predictor 2 


Bootstrap frequency 


KLKQ3E 


CD36x2aINV 


0.46 


CypllB2-5paINV 


PTPNlx9INV 


0.43 


PTPNli4 


CypllB2-5paINV 


0.32 


MLRI2V 


PTPNlxMNV 


0.31 


MLRi6INV 


CD36x2aINV 


0.27 


AGT2R1A1166C 


Reached menopause? 


0.26 


PTPNlx9INV 


CypllB2i4INV 


0.25 


CD36x2aINV 


AGT2R2C1333T 


0.21 


CypllB2i4INV 


insulin t=60 


0.2 


PTPNli4INV 


BADG16R 


0.2 




Convex hierarchical test 


AGT2R1A1166C 


Reached menopause? 


0.49 


HUT2SNP7 


insulin t=-10 


0.44 


KLKQ3E 


insulin t=60 


0.41 


insulin t=120 


Reached menopause? 


0.4 


CypllB2i4INV 


insulin t=-10 


0.39 


ADRB3W1R 


insulin t=-10 


0.37 


CypllB2i4INV 


insulin t=60 


0.35 


KLKQ3E 


CD36x2aINV 


0.28 


CypllB2-5paINV 


insulin t=-10 


0.25 


CypllB2i4INV 


insulin t=120 


0.23 



Table 3: Ten most frequent interactions found by all pairs and convex hierarchical test method 
over 500 bootstrap replications. 
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4 parameters 7 parameters 9 parameters 




15 parameters 15 parameters 22 parameters 




32 parameters 42 parameters 65 parameters 




Figure 5: Convex hierarchical testing: main effects (black dots) and interactions (edges), for 9 
different decreasing values of A. Weak hierarchy ensures that each edge is incident to at least one 
black dot. 
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Figure 6: Proportion of interactions found in both random halves of the data (vertical axis), as the 
total number of interactions selected is varied (horizontal axis). 



6 Estimation of the false discovery rate 

Permutations provide a convenient and robust way to estimate false discovery rates in 
large-scale hypothesis testing. For example Simon & Tibshirani (2012) devise a permutation 
scheme for the all pairs interaction test. In this scheme, one randomly assigns a component 
of the interaction contrast to group 1 or group 2, by flipping the sign of the component at 
random. 

This scheme can be easily adapted to the present setting. Here are the details. The idea 
is to retain the main effect contrasts Wj from the original fit, and create randomized versions 
of the interactions as follows 



z jk = i^2( x ij ~ x 2j)(%ij ~ a?2jt)/ (sj,es k/ ) - ^2 ( x v - x ij)( x ij ~ x ik) / ' {sj,tSk,e)]/ s jkVVn~i - 

where C£, are the classes from a permutation y* of the class labels y. We fit the model to 
the contrasts (wj, z* k ) to obtain b , for B permutations b = 1, 2, ... B. Finally, we estimate 
the FDR as 

FDR(X) = E ^ fe/(A ^ >A)/i? (6) 

e^(a;,>a) 

Note that this estimate pools the null distributions from all j, k pairs. This kind of pooled null 
is commonly used, for example in the SAM procedure (Tusher, Tibshirani & Chu 2001) and 
the aforementioned interaction test of Simon & Tibshirani (2012). Its accuracy is quite high 
in simulation studies, although we know of no rigorous results on its asymptotic properties. 
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Hierarchical truth 




Number called significant 



No main effects 




Number called significant 



Figure 7: Estimation of FDR for convex hierarchical testing using permutations. Result is 
average over 30 simulations, with average standard error bars shown on right. Vertical line 
drawn at true number of non-zero interactions. 
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Figure 7 shows the estimated FDR from this method for our example earlier. The es- 
timate is fairly accurate, but tends to over-estimate the true FDR by a moderate amount. 
This may be due to the interdependence of the test statistics Xjk for each j. In future work, 
it would be important to study the theoretical properties of this permutation estimate. 

7 Discussion 

We have proposed a hierarchical method for large-scale interaction testing, that biases 
its search towards interactions exhibiting at least one moderate main effect. There are 
a number of ways that this work could be generalized. We have focussed exclusively on 
pairwise interactions: extensions to /c-way interactions, for k > 2 would bound the sum of 
such interactions by the size of the k — 1 order effect. With appropriate definitions for the 
interaction components zjk, one could also apply this procedure to interaction testing for the 
proportional hazards model in survival analysis. 

Acknowledgements. Robert Tibshirani was supported by National Science Foundation 
Grant DMS-9971405 and National Institutes of Health Contract N01-HV-28183. 

A A detailed look at the optimization problem (Ej) 

A.l Roadmap for appendix 

Our procedure is based on the following optimization problem: 

Minimize \ - (# - Pj)f + £(z,-» - 9 jk f + X 1 J>+ + f)j] + A 2 £ £ \9 jk \ 

3=1 j=l j^k j=l j=l k^j 

s.t. #>o,£;i0 ifc | <p}+pjio-rj = i,...,p. 

k+3 

Observe that this problem decouples into p separate problems, one for each j, involving the 
variables f3j, 9j.). For notational simplicity, let w = Wj, z = (zji, . . . , Zjj-i, Zjj+i, ■ ■ ■ , z jp) T i 
^ = (5^, and 9 = (9ji, . . . , 6j,j+i, ■ ■ ■ , 6jp) T - The jth problem, whose solution is 

is 

Minimize^, eeRP -x ±(w - - /3")) 2 + \\\z - 9\\ 2 + X 1 (f3 + + /T) + A 2 ||0|| 1 (7) 

s.t. fi± > 0, l^lli < P + +P~. 

Because our original problem decouples into p separate problems involving the j'th main 
effect and the associated p — 1 interactions with that variable, we will study problem (J7|) 
throughout Appendix [A] Once this "jth-row" problem is well-understood, we will by direct 
extension have studied the original problem. Therefore, for the rest of Appendix [A] the 
"inputs" to the problem will be thought of as w G M and z G IR P_1 and the solution is 
denoted by 9) G M 2 " 1 "^" 1 ^, which in the rest of the paper is denoted by ((3^, f3j , 9j.). 

In Appendix [Bj we apply the results of Appendix [A] to the paper's main problem. Here is 
an overview of the main elements of Appendix [Aj 
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• We write out the Karush-Kuhn Tucker (KKT) conditions for problem (J7J) and observe 
that this leads to a very simple characterization of the (primal) solution in terms of 
the optimal dual variables (7 + ,7~,d). 

• We prove that the solution is unique (Lemma [1]). This is important for establishing 
that our test statistics are well-defined (since they are defined as the largest A for which 
a primal variable becomes nonzero). 

• We characterize the solution path in Proposition [TJ It turns out that the path is 
conveniently described by distinguishing among three cases: the first is the "Big main 
effect" case, in which w > \\z\\i, the second is the "Moderate main effect" case, in 
which || a; || oo < w < \\z\\i, the third is the "Big interaction" case, in which w < ||z||oo- 
The path is described in terms of some special values of A that are defined in Lemma [2j 
This Lemma breaks down by case whether these values are finite and if so the relative 
size of these values. The description of the path in Proposition [1] is not completely 
closed-form, but it turns out that our characterization of the path is precise enough 
for our purposes. 

• In particular, our test statistics only depend on when the solution paths become 
nonzero. Proposition [3] describes these values in closed form and Corollary [TJ which is 
the ultimate goal of Appendix [S] provides a very simple expression for the points at 
which the solution path becomes nonzero. 

A. 2 KKT conditions, a primal-dual relation, and uniqueness 

The Lagrangian is 

L(/3± 9; 7 ± , a) J-{w - (/3+ - /T )) 2 + l -\\z - Qf + (A 2 + a)||0||i 
+ ( Al _ 7 + _ a )p+ + ( Al - j- - a )/r , 

where 7 ± ,a > are dual variables corresponding to the constraints. The KKT conditions 
are 

/3~ - w) + Ai - 7 + - a = 
(3~ — w) + Ai — 7~ — a = 
6 - z+ (A 2 + a)s = 
7 + /3 + = 
7"/3- = 

a(\\9\\ 1 -(3 + -r) = 
0\\i<P + + p-; /3 + ,/T>0 



(/? + - 
-(/? + - 
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where Sk is a subgradient of \9k\ evaluated at the solution. The stationarity conditions (first 
three lines) imply that 

/3 + - j3- = w + (j + - j~)/2 
9 = S(z, \ 2 + a) 

Lemma 1. Assume Ai > 0. Then, the solution to ([7]) is unique. 

Proof. Let ((3 + ,(3~,9) be a solution to (J7J with associated dual variables (d,^). Since (J7J 
is convex, it suffices to show that the solution is unique in a neighborhood around this point. 
Noting that the objective function is strongly convex in all directions except for (fi + ,j3~,6) 
for which j3 + — j3~ = (3 + — 0~, it remains to consider perturbations of the optimal point of 
the form (3~ , 9) = + e, /3~ + e, 9 + 5). If this new point is a solution, it must have a 
corresponding set of dual variables (7 + ,7 _ ,a) satisfying the KKT conditions. We begin by 
showing in all cases that e ^ implies 9 = 9: 

• /3 + > 0,/3~ > 0: In this case, 7 + = 7~ = and d = Ai. Choosing |e| small enough 
such that f3 + , f3~ > 0, we must have 7 + = 7~ = and so a = Ai. It follows that 9 = 9. 

• (3 + > 0, (3~ = 0: In this case, 7 + = 0, and e > for our perturbation to be feasible. 
If e > 0, then > and so 7+ = 7" = and a = X 1 . Now, [3+ - (3~ = 

so the first KKT condition for this new point is — w) + \\ — a = 0; however, the 
first KKT condition for our original point is — w) + Ai — a = 0. This implies that 
a = a, and so 9 = 9. 

• (3 + = 0, (3~ > 0: Identical argument to previous case. 

• (3 + = (3~ = 0: Again, e > is required for new point to be feasible. If e > 0, then 
7 + = 7" = and a = X\. On the other hand, the KKT conditions for the original 
point implies a < Ai — \w\ (since ±w + Ai — a = 7 ± > 0). Thus, a > a. Now, 9 = 
means that ||z||oo < A2 + « < A2 + a, and so 9 = as well. 

Thus, our new point is ((3 + ,(3~,9) = (f3 + + e, f}~ + e,9), and so the objective value at the 
two points differs by 2Aie. It follows that if e 7^ 0, then one of the points is not optimal. 
Therefore, ([7]) has a unique solution. □ 

A. 3 Characterizing the path 

We take Ai = A2 = A, and now consider the path of solutions generated by varying A > 0. 
Note that by Lemma (TJ it makes sense to speak of "the" path. We assume, without loss of 
generality, that w > 0. In studying the path, as we do in the proof of Proposition [H we find 
that there are three "special" values of A at which the behavior of the path changes. These 
are the following: 

Ai = min{A > : \\S(z, A)||i + A < w}, 
A 2 = max{A > : \\S(z, A)||i + A < w}, 
A 3 = max{A > : \\S{z, 2A) ||i > w}. 
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In Lemma [2], we collect some facts about Ai,A2,A3 that will be useful in the rest of the 
appendix; however, most readers may prefer to skip directly to Proposition [TJ which describes 
the solution path. 

Lemma 2. Defining \z\\ > |z 2 | > ■ ■ • , the following statements hold: 
1- \\z\\<x> — w iff- Ai, ^2 ar e finite, in which case Ai < \z%\ and A2 = w. 

2. \\z\\\ > w > iff. A3 finite, in which case A3 < (1 — iw/||z||i)||z||oo/2. 

3. If || 2: || oo < w < \\z\\i, then A3 < Ai < A2. 

Proof. The function /i(A) = \\S(z, A)||i + A is piecewise-linear, convex, and minimized on 
A G [|^2 1 , l^i |] with minimal value |zi| (where |zi| > |z 2 | > •••)■ Thus, [Ai,A2] defines a 
nonempty interval iff. ||z||oo < w. If this holds, then Ai < |z 2 | and A2 = w. Likewise, A3 is 
finite iff. ||^||i > w. 

The function /2(A) = \\S(z, 2A)||i is a piecewise linear, decreasing convex function with 
^2(0) = \\z\\i and / 2 (||2;|| 0O /2) = 0. Since / 2 is convex, it lies beneath the line L(X) = 
\\z\\i — 2(||2r||i/||2;|| 0O )A on the interval A G [0, H^Hoo/2] and so A 3 < A where L(X) = w, i.e. 
A = (1 - u;/||z||i)||z|| 00 /2. 

Finally, observe that f^X) - / 2 (A) = ||S(z,A)||i - \\S(z, 2X)\\ 1 + X > X since ||S(z,-)l|i is 
a non-increasing function and 2A > A. Thus, for A > 0, /i(A) > /2(A) from which it follows 
that A 3 < Ai. We can only have equality if A x = A 3 = 0, in which case \\z\\i = ||z||oo — w (i- e - 
z has no more than one nonzero value) , which is a probability event under most reasonable 
models. □ 

Proposition 1. Let Ai, A 2 , A3 be as defined above, let A4 = (w + ||z|| 00 )/2, and write (3 = 
P + — P~ . The solution path of ([7]) depends on the relative size of the main effects and 
interactions and is given by the following: 

1. Big main effect: ||z||oo < < w. 

X > w p = 0, 9 = [Case III] 

Ai<A<w=^/§ = w-A, 9 = S(z, A) [Case I(i)] 

A<Ai (3 = w-X + a(X), 9 = S(z, X + d(A)), a(X) > [Case I(ii)(a)C] 

2. Moderate main effect: ||z||oo < w < ||z||i. 

X>w => j3 = 0, 9 = [Case III] 

X x <X<w => j3 = w-\, 9 = S(z, A) [Case I(i)] 

A 3 <A<Ai p=w-X+ d(A), 9 = S(z, X + a(X)), a(X) > [Case I(ii)(a)B.] 

A < A 3 => j3 = w, 9 = S(z, 2A) [Case II] 
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z \\oo < Iklli- 

0, 6 = [Case III] 

w - A + d(A), = S(z, A + d(A)), d(A) > [Case I(ii)(b) + (a)A.] 

fl = S(z,2A) [Case II] 

Proof. We partition the solution path as follows: 



Case I. 




>oJ- 


= 


Case II. 




>oJ- 


> 


Case III. 




= oJ- 


= 


Case IV. 




= oJ~ 


> 



By Lemma (TJ this is a well-defined partition of the path (i.e., for a given A > 0, only one of 
the following cases holds). 

Case I. $+ > 0, p~ = 0. 

The KKT conditions imply that 7 + = 0, (3 + = w — A + a, and 7" = 2(A — a). 
The conditions j~, a > and f3 + > require that 

0<d<A a > \ — w 

and, defining f\(a) = \\S(z, A + a) ||i — w + A — a, they also require that 

f x (a) < with af x (a) = 0. 

Thus, this case occurs iff. ( "if" direction by Lemma [1]) there exists at satisfying 
these constraints. We consider two subcases: 

(i) a = 0. Requires A — w < and \\S(z, A)||i + A < w. In light of Lemma El 
this subcase happens iff. ||z||oo < w and A G [Ai,w). 

(ii) a > 0. Requires f\(a) = 0. Now, / A is a strictly decreasing function, so to 
show that such an a exists with [A — w] + < a < A, it suffices to check that 

f(a) /a(0) > > /a (A) if A-u;<0 

\(b) A(A-w) >0>/ a (A) ifA-u;>0 

Now, > / A (A) = \\S(z, 2A)||i— w, so by LemmaEJ this constraint is satisfied 
for all A if ||z||i < w and for A > A3 when ||z||i > w. 



3. Big interaction: w < \\ 

A > A 4 =^ p = 
A 3 < A < A 4 => p = 
A < A 3 =^ P = 
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(a) / A (0) > -<=^ \\S(z, A) ||i + A > w, which (in light of Lemma [2]) is 
satisfied for all A when ||z||oo > w and for A ^ [Ai, A 2 ] when ||z||oo < w. 
Incorporating this with the constraints from > j\ (A) and A < w (and 
using that ||;z||oo < ||^||i and A 2 = w) gives the following cases in which 
(a) holds: 

A. || 2; || oo > w, A3 < A < w, 

B. Halloo < w < \\z\\i, A 3 < A < Ai 

C. ||z||i < w, A < Ai 

(b) We have A > w and < /a(A — w) = \\S(z, 2A — iu)||i, which holds as 
long as Halloo > 2A — w, i.e. A < (||^||oo + w)/2. 

A. H^lli < w < A < (Halloo + w)/2. But this case can never occur since 
it would imply max{w, ||z||i} < (||-z||oo + w)/2< (||z||i + w)/2. 

B. || 2 ||i > w, A > A 3 , w < A < (||z||oo + w)/2, 

Case II. /3 + > 0, f3~ > 0. 

In this case, 7" 1 " = 7" = 0, so we have (3 + — (3~ = w and a = A. Now, f3 + = 
w + f3~ > w and so /3 + + j3~ > w. The only remaining requirement of the KKT 
conditions is that \\S(z, 2A)||i = (3 + + f3~ > w. This case occurs iff. (again, we use 
Lemma[T]for the "if" direction) \\S(z, 2A)||i > w, which by Lemma[2]is equivalent 
t° || z ||i > w an d A < A3 (note that A = A 3 could only occur when w = 0, a case 
we exclude). 

Case III. /3+ = 0,/3" = 0. 

In this case, ^ = A ± w — a > 0, and so we require < a < A — w with 
|| S(z, A + a) ||i < 0. Putting this together gives 

[Iklloo - A] + < a < A - w. 

This interval is nonempty iff. [||z||oo — A] + < A — w, i.e. 

A > max{w, (||^||oo + w )/2}- 

It is interesting to break this /3 + = f3~ = case into two subcases: the case in 
which the everything is zero even without the hierarchy constraint and the case 
in which the hierarchy constraint is the "reason" for everything being zero (in 
other words a > and we wouldn't be in this case if a = 0). The hierarchy- 
active case occurs when ||z||oo — A > 0. In light of the lower bound on A given 
above, active hierarchy sets everything to zero specifically when ||^||oo > w and 
A> (IklU + w)/2. 

Case IV. /3 + = 0, j3~ > 0. 

The KKT conditions imply that 7" = 0, j3~ = a — (w + X) and 7 + = 2(A — a) > 0. 
Putting these together gives a < A and (3~ < —w < 0. In other words, this case 
does not occur! 
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We summarize these cases more succinctly: 

Case I. 0+ >0,f}- = 0. 

(i) a = iff. Halloo < w and Ai < A < w. 

(ii) A > iff. 

(a) A. Halloo > w, A3 < A < w, 

B. || 2; || oo < w < \\z\\x, A 3 < A < Ai 

C. ||z||i < w, A < Ai 

(b) ||z||i > w, max{u>, A3} < A < (H^Hoo + w)/2, 

Case II. (3+ >0,(3~ > 0. 

iff. ||z||i > w and A < A 3 (note that A = A 3 could only occur when w = 0, a case 
we exclude). 

Case III. /3 + = oJ~ = 0. 
iff. 

A > max{w, (Halloo + w)/2}. 
The hierarchy- active case occurs iff. ||z||oo > w an d ^ > (Halloo + w )/2- 

Case IV. /3+ = 0, j3~ > 0. 

Does not occur! 

By rearranging these cases depending on the relative sizes of w, ||z||ocn an d ||^||i, we get the 
paths given in the Proposition statement. □ 

Proposition 2. The solutions \8j(X)\ are non-increasing in A. 

Proof. Based on Proposition [I], this statement is immediate except when a > 0, i.e. Case 
I(ii). Referring back to this case in the proof of Proposition [TJ let d(A) be the uniqu^j] point 
for which f\(a(\)) = 0. Defining t(A) = A + a(A), we have 

A(a(A)) = \\S(z,t(\)\\ 1 -t(\) + \ = w. 

This must hold for A over the range in which d(A) > (as specified in Proposition [1]). Since 
\\S(z, t)\\i — t is a strictly decreasing function of t, it follows that increasing A requires an 
increase in t(X) for A. Since 9 = S(z,t(X)), this proves that \9j\ is nonincreasing in A. □ 

1 Since fx is strictly decreasing, it has a unique root. 
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A. 4 Where do the paths become nonzero? 

The results from the previous section provide sufficient information for us to derive an 
exact, closed- form expression for the point in the path at which each coefficient becomes 
nonzero. 

Proposition 3. Recall that Ai = min{A > : \\S(z, A)||i + A < w}. The main effects and 
interactions become nonzero at 

v = sup{A : \f}\ ^ 0} 
v k = sup{A : \9 k \ ^ 0}, 

which have the following "closed form" values: 

1. Big main effect: ||z||oo < ||^||i < w - 



for \z k \ > Ai 
11^,1^1)110/2 for \z k \ < Ai 

2. Moderate main effect: ||-z||oo < w < \\z\\i. 

for \z k \ > Ai 
II^I^DH^^ /or |^.| < A x 

3. Big interaction: w < ||z||<x> — \\ z \\i- 

v = (w + p||oo)/2 
h = \z k \/2+[w-\\S(z,\z k \)\\ 1 } + /2 



V = w 




\ z k\ 

(\z k \ + W - 



V = w 



\z k \ 

\z k \/2+[w 



Proof. The expressions for v and v k follow from Proposition [TJ The only parts that are not 
immediate are the cases in which a (A) > 0. 

We begin by considering the "Big main effect" when A < Ai. By definition of v k and by the 
expression for 9 in this case, we see that \z k \ = u k + a{u k ). We also know that fv k {pi{pk)) — 
(since a{y k ) > 0). Putting these together gives \\S(z, |^fc|)||i — w + 2i> k — \z k \ = 0, which we 
solve for v k . We also require that a(z>0 > and v k > 0, i.e. that < v k < \z k \. Notice 
that \z k \ < Ai implies that \z k \ > w — \\S(z, |^fe|)||i, establishing that k < \z k \ for \z k \ < A x . 
In the "Big main effect" case, it is easy to see that {\z k \ + w — \\S(z, |zfc|)||0/2 > (which 
implies that v k > 0, as required). This completes the proof for the "Big main effect" case. 

In the "Moderate main effect" case, if \z k \ < A 1; we know that one of two cases can 
occur. The logic proceeds identically to the "Big main effect" case, except that we are not 
guaranteed that (\z k \ + w — \\S(z, |zjfc|)||0/2 > 0, which must hold for us to have v k > 0. 
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If this does hold, we're done. Assume that instead (\z k \ + w — \\S(z, |.Zfc|)||i)/2 < 0. This 
implies that \\S(z, |^fc|)||i > w or equivalently that \z k \/2 < A3. This means we're in Case II 
and so k = \z k \/2. 

Finally, we consider the "Big interaction" case. By Proposition [TJ it is clear that k < A4 
and the expression for k will depend on whether k > A3. As above, if k > A3, then we 
would have k = (\z k \ + w — \\S(z, |zfc|)||i)/2, which to be valid must fall within [0, \z k \) (to 
ensure that k > and a{0 k ) > 0). These requirements simplify to \z k \ > w — \\S(z, |.Zfc|)||i 
and \z k \ > —(w— \\S(z, |^fc|)||i)- This first inequality always holds in this case since, as noted 
in the proof of Proposition [TJ /i(A) = \\S(z, A)||i + A > ||z||oo > w. If the second inequality 
holds, then k = (\z k \ +w — \\S(z, \zk\)\\i)/2. Otherwise, we have \\S(z, |^fc|)||i > w, implying 
that \z k \/2 < A3 and we're in Case II, i.e. k = \z k \/2. □ 

Remarks: 

• In the "Big main effect" case, only small interactions are modified. In particular, the 
test statistic of these smallest interactions are made smaller. 

• In the "Moderate main effect" case, interactions below a certain threshold are modified, 
with the very smallest ones being reduced to half their size. 

• In the "Big interaction" case, the main effect statistic receives a positive boost and the 
interactions are reduced. Notice that for \z k \ = ||z||ao, we have k = 0, i.e. the largest 
interaction enters at the same time as the main effect. 



v = max < w 



Corollary 1. The and k defined as in Proposition^ are given by 

W + || ^ || op 

2 

2" 



\z k \ [w- \\S(z, l* fc l)||i] + 
v k — mm < \z k , — — I 



Proof. For the "Big main effect" and "Moderate main effect" cases, observe that \z k \ > 
\z k \/2 + [w - \\S(z, |2fc|)||i] + /2 is equivalent to \\S(z, \z k \)\\i + \z k \ > w (assuming z k ^ 0), 
which can be related directly to \z k \ < \\. Forthe "Big interaction" case, \\S(z, \z k \)\\i + \z k \ > 
\\z\\oo > w and so \z k \ > \z k \/2 + [w — \\S(z, \z k \)\\i} + /2 holds automatically. □ 

Remarks: 

• The main effect statistic is boosted when there's at least one large interaction. 

• Interaction statistics get reduced by at most a factor of 2. The size of the reduction 
for k depends only on the size of the main effect and on those interactions that are 
larger (in absolute value) than \z k \. In particular, \\S(z, |zfc|)||i = ^2k-\z k \>\z \(\ z k\ ~ \ z k\) 
measures how much larger the other interactions are compared to \z k \, and the larger 
this value, the more k is reduced. 
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B Test statistics 

In Appendix we studied what happens to the jth variable (i.e. its main effect and 
p—1 interactions). In this appendix, we use the result of Appendix |X]to obtain a closed-form 
expression for our test statistics. To do so, we return to the main paper's notation, in which 
Wj (rather than w) represents the jth main effect, Zjf. (rather than z k ) represents the jkth 
interaction, Xj (rather than v) represents the test statistic for the jth main effect and Xjk 
(rather than v^) represents the test statistics for the jkth interaction. 

Proposition 4. Problem^ has a unique solution path, ^/3 + (A), /3~(A), #(A) j e ]R 2 p+p(p-i/2) 
for A > 0. The values 

A j = sup{A>0: /3+(A)-/3r(A)^0} 
X jk = sup{A > : 6 jk (X) ^ 0} 

can be expressed in closed-form as 

t f I I \ w j\ + Halloo 1 

Xj = max 1 1^1, | 

Xjk - mm < \z jk \, H 2 f ' 

where Zj. G W' 1 is the vector of interaction contrasts involving the jth variable and Wj is 
the main effect contrast. 

Proof. This follows immediately from Corollary [TJ □ 
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